Method and apparatus for performing bilingual word alignment

ABSTRACT

The invention discloses a method and apparatus for performing bilingual word alignment, relating to the field of text information formation processing. The method comprises the following steps: preprocessing the source and the translated (target) text of the bilingual documents to be aligned; computing the probability of the link (i.e. the corresponding relation) between each and every source and target word; setting the initial word alignment as an empty alignment; iteratively applying a greedy algorithm to search for word alignments that are permitted by the inversion transduction grammar constraint; outputting the best word alignment among the alignments that satisfy the inversion transduction grammar constraint as the final alignment result. The apparatus comprises: a preprocessing module, a probability computation module, an initial word alignment generation module, a word alignment search module, an alignment result output module. Using the greedy algorithm in an iterative fashion with the inversion transduction grammar constraint, the present invention demonstrates the improvement in both alignment speed and alignment robustness.

TECHNICAL FIELD

The present invention relates to information processing field, particularly relates to a method and an apparatus for performing bilingual word alignment.

BACKGROUND

With the development of globalization, the needs for automatic or semi-automatic translation techniques grow rapidly. The aim of word alignment is to find the translation relation between the corresponding words of parallel bilingual text. Word alignment is a fundamental technique to provide the basis for statistical machine translation, i.e. a technology for automatically translating texts in one language into another by a computer using statistical methods, and it has significant influence on translation quality. In addition, word alignment provides a technique to support technologies such as cross-language retrieval.

Inversion transduction grammar is a family of bilingual synchronous grammars proposed by Dekai Wu. Each instance of inversion transduction grammar consists of several grammar rules, which define the transduction relationship between a particular set of symbols. A pair of start symbols may be synchronously rewritten into a pair of new strings by applying these rules. The inversion transduction grammar requires that each grammar rule should be in one of the following six forms:

-   -   s→ε/ε     -   A→χ/ε     -   A→ε/y     -   A→χ/y     -   A→[B C]     -   A→<B C>         where ε denotes an empty string, the first four rules denote         that the symbols on the left of the arrow can generate the two         symbols on the right of the arrow. For example, the 4^(th) rule         denotes that symbol A can generate symbol x and y; the 5^(th)         rule denotes that symbol A can generate two strings “B C” and “B         C” at the same time; the 6^(th) rule denotes that symbol A can         generate two strings “B C” and “C B” at the same time.

Existing research shows that introducing the inversion transduction grammar constraint into word alignment (i.e. requiring that the word alignment can be generated with the inversion transduction grammar) significantly improves the quality of word alignment. The computational cost for searching over all possible word alignments that satisfy the inversion transduction grammar constraint is, however, too high to be used in practical. A few approximating search algorithms has been proposed to reduce computational cost by searching over only part of all the potential word alignments that satisfy the inversion transduction grammar constraint. The computational cost is still too high for practical use.

SUMMARY

The embodiments of the present invention provide a method and apparatus for performing bilingual word alignment that further reduce the computational cost while maintaining alignment quality and robustness. The technical solution is summarized as follows (in the following description, if two words are considered to have a corresponding relation, we say this is a “link” between these two words):

In one aspect, the invention provides a method for performing bilingual word alignment, comprising:

1. preprocessing the source and target text of the bilingual documents;

2. computing the probability gains of adding a link between any word in the source text and a word in the target text;

3. setting the initial word alignment as an empty alignment, i.e. there is no link between any two words;

4. applying a greedy algorithm to search iteratively for word alignments that satisfy the inversion transduction grammar constraint.

5. outputting the best word alignment among the alignments found in steps 3-4 that satisfy the inversion transduction grammar constraint as the final alignment result.

In another aspect, the invention provides an apparatus for performing bilingual word alignment, comprising:

a preprocessing module for preprocessing the source and target text of the bilingual documents

a probability computation module for computing the probability gains of adding a link between any source and target words, and selecting all links whose gains are positive;

an initial word alignment generation module for generating an initial word alignment; a word alignment search module for applying a greedy algorithm to search iteratively for word alignments that satisfy the inversion transduction grammar constraint; and

a word alignment result output module for outputting the best word alignment among the alignments found during the search process that satisfies the inversion transduction grammar constraint as the final alignment result.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to clearly describe the technical solution, the embodiments of the present invention are described in detail below with reference to the following drawings. It is obvious that the following described drawings are only some of the embodiments of the invention, those skilled in the art may obtain other drawings according to these drawings without an inventive step.

FIG. 1 is a flow chart of the method for performing bilingual word alignment provided by embodiment 1 of the present invention;

FIG. 2 is a flow chart of the method for performing bilingual word alignment provided by embodiment 2 of the present invention;

FIG. 3 is a schematic diagram of the apparatus for performing bilingual word alignment provided by embodiment 3 of the present invention;

FIG. 4 is a schematic diagram of the word alignment search module structure in the apparatus for performing bilingual word alignment provided by embodiment 3 of the present invention.

DETAILED DESCRIPTION

In order to make the objectives, technical solutions and advantages of the present invention clear, the implementations of the present invention will be further described in combination with drawings with more details.

Embodiment 1

Referring to FIG. 1, this embodiment provides a method for performing bilingual word alignment. The workflow is as follows:

101. preprocessing the source and target text of the bilingual documents to be aligned;

102. computing the probability gain of adding a link between any source and target words, and selecting all links whose gains are positive;

103. setting the initial word alignment as an empty alignment, i.e. there is no link between any words;

104. applying a greedy algorithm to search iteratively for word alignments that satisfy the inversion transduction grammar constraint;

105. outputting the best word alignment among the alignments found during the search process that satisfies the inversion transduction grammar constraint as the final alignment result.

The present invention applies a greedy algorithm to search iteratively for the word alignments that satisfy the inversion transduction grammar constraint. As a result, it produces robust word alignment more quickly than previous methods.

Embodiment 2

Referring to FIG. 2, this embodiment provides a method for performing bilingual word alignment. The present invention uses a greedy algorithm to search for the word alignments that satisfy the inversion transduction grammar constraint in an iterative way, it can find word alignment with comparable quality more quickly than previous methods. The specific workflow is as follows:

201: preprocessing the source and target text of the bilingual documents;

Specifically, the preprocessing includes, but is not limited to, splitting the source and target text into individual words; removing redundant blank characters; recording the number of words in the source text (denoted as I), and the number of words in the target text (denoted as J)

202: computing the probability gains of adding a link between any source and target words, we denote the set of links with positive gains as L;

Specifically, the gain from adding a link between the i^(th) source word e_(i) and the j^(th) target word f_(j) is defined as

${{gain}\mspace{11mu} \left( {i,j} \right)} = \frac{p\left( {f_{j},e_{i}} \right)}{{p\left( {f_{j},\varepsilon} \right)} \times {p\left( {\varepsilon,e_{i}} \right)}}$

where p(f_(j), e_(i)) is the probability of e_(i) that is aligned to f_(j); p(ε, e_(i)) is the probability of e_(i) that is not aligned to any word; p(f_(j), ε) is the probability of f_(j) that is not aligned to any word. These probabilities could be estimated by any existing method or method that may be invented in the future, not limited by this embodiment. One of the possible implementations is as follows: p(f_(j), e_(i)), p(ε, e_(i)) and p(f_(j), ε) could be estimated by the translation probabilities proposed in model 4 in the paper The Mathematics of Statistical Machine Translation: Parameter Estimation by P. F. Brown, et al, specifically

$\begin{matrix} {{p\left( {f_{j},e_{i}} \right)} = \frac{{p_{m\; 4}\left( {f_{j}e_{i}} \right)} \times {p_{m\; 4}\left( {e_{i}f_{j}} \right)}}{2}} \\ {{p\left( {f_{j},\varepsilon} \right)} = {p_{m\; 4}\left( {f_{j}\varepsilon} \right)}} \\ {{p\left( {\varepsilon,e_{i}} \right)} = {p_{m\; 4}\left( {e_{i}\varepsilon} \right)}} \end{matrix}$

where p_(m4)(f_(j)|e_(i)), p_(m4)(e_(i)|f_(j)), p_(m4)(f_(j)|ε), p_(m4)(e_(i)|ε) are the translation probabilities proposed in model 4 by Brown, et al. 203: setting the initial word alignment as an empty alignment, i.e. there is no link between any words; 204: adding the initial word alignment to the pending list OPEN; 205: judging whether the pending list OPEN is empty, i.e. OPEN does not contain any word alignment; if yes, step 211 is performed; otherwise, step 206 is performed; 206: setting the local pending list CLOSE as an empty list; 207: judging whether the pending list OPEN is empty, i.e. OPEN does not contain any word alignment; if yes, step 210 is performed; otherwise, step 208 is performed; 208: picking out a word alignment A from the pending list OPEN, and removing A from OPEN; This embodiment does not impose any restriction on the picking strategy. 209: For each link l in L, if l is not in A, denote B as the word alignment obtained by adding l into A; if B satisfies the inversion transduction grammar constraint, B will be added to the local pending list CLOSE; if the number of the word alignments included in CLOSE is more than b, only the best b word alignments in CLOSE will be kept;

This step could use any existing method or method that may be invented in the future to determine whether a word alignment satisfies the inversion transduction grammar constraint.

The following formula is used to evaluate the quality of a word alignment, the higher the value of the formula is, the better the word alignment is:

$\prod\limits_{{({j,i})} \in a}{{p\left( {f_{j},e_{i}} \right)} \times {\prod\limits_{i \notin a}{{p\left( {\varepsilon,e_{i}} \right)} \times {\prod\limits_{j \notin a}{p\left( {f_{j}\varepsilon} \right)}}}}}$

The formula consists of three parts: the first part is the product of all the values of p(f_(j), e_(i)) for any pair of f_(j) and e_(i) that have a link between them, the second part is the product of all the values of p(ε, e_(i)) for any source word which is not aligned to any word, and the third part is the product of all the values of p(f_(j), ε) for any target word which is not aligned to any word, and the formula represents a product of the three parts. The definitions for p(f_(j), e_(i)), p(ε, e_(i)) and p(f_(j), ε) are the same as those in step 209.

This embodiment imposes one and only one limit. That is b should be a number no smaller than 1. The specific value of b depends on the requirement of word alignment speed and the quality of the word alignment in a practical scenario. The smaller b is, the faster the speed is and the poorer the quality of the word alignment will be; the larger b is, the slower the speed is and the better the quality of the word alignment will be.

210: setting OPEN=CLOSE; 211: outputting the best word alignment among the alignments found during the search process that satisfy the inversion transduction grammar constraint as the final alignment result. This step has the same evaluating standard for word alignment quality with step 209.

Embodiment 3

FIG. 3 refers to an apparatus for performing bilingual word alignment comprising:

a preprocessing module 301 for preprocessing the source and target text of the bilingual documents to be aligned;

a probability computation module 302 for computing the gains of adding a link between any source and target words, and selecting all links whose gains are positive;

an initial word alignment generation module 303 for generating an initial word alignment;

a word alignment search module 304 for using a greedy algorithm to search iteratively for word alignments that satisfies the inversion transduction grammar constraint; and

a word alignment result output module 305 for outputting the best word alignment among the alignments found during the search process that satisfy the inversion transduction grammar constraint as the final alignment result.

Specifically, the preprocessing module 301 is for splitting the source and target texts into individual words; removing redundant blank characters, and recording the number of words in the source text (denoted as I), and the number of words in the target text (denoted as J).

In the probability computation module 302, the gain of adding a link between the i^(th) source word e_(i), and the j^(th) target word f_(j) is defined as

${{gain}\mspace{11mu} \left( {i,j} \right)} = \frac{p\left( {f_{j},e_{i}} \right)}{{p\left( {f_{j},\varepsilon} \right)} \times {p\left( {\varepsilon,e_{i}} \right)}}$

where p(f_(j), e_(i)) is the probability of e_(i), that is aligned to f_(j); p(ε, e_(i)) is the probability of e_(i) that is not aligned to any word; p(f_(j), ε) is the probability of f_(j) that is not aligned to any word.

In the initial word alignment generation module 303, the initial alignment is set as an empty alignment, i.e. there is no link between any words.

In the word alignment result output module 305, the following formula is used to evaluate the quality of a word alignment:

$\prod\limits_{{({j,i})} \in a}{{p\left( {f_{j},e_{i}} \right)} \times {\prod\limits_{i \notin a}{{p\left( {\varepsilon,e_{i}} \right)} \times {\prod\limits_{j \notin a}{p\left( {f_{j}\varepsilon} \right)}}}}}$

the formula consists of three parts: the first part is the product of all the values of p(f_(j), e_(i)) for any pair of f_(j) and e_(i) that have a link between them, the second part is the product of all the values of p(ε, e_(i)) for any source word which is not aligned to any word, and the third part is the product of all the values of p(f_(j), ε) for any target word which is not aligned to any word, and the formula is a product of the three parts. The definitions for p(f_(j), e_(i)), p(ε, e_(i)) and p(f_(j), ε) are the same as those in the link gain computation module 302.

Furthermore, FIG. 4 refers to word alignment search module 304, including:

a pending list initializing unit 304 a, for initializing a pending list as a list only including empty word alignment;

a local pending list generation unit 304 b, for expanding the word alignments in the pending list, and generating a local pending list;

a pending list resetting unit 304 c, for resetting the pending list into the local pending list; and

a branch selecting unit 304 d, for determining whether to return to unit 304 b or end the entire word alignment searching module 304 and then enter into the word alignment result output module 305.

Wherein in the local pending list generation unit 304 b, supposing L is the set of links which have positive gains that are computed in the link gain computation module 302; for each link l in L, if l is not in A, denote B as the word alignment obtained by adding l into A; if B satisfies the inversion transduction grammar constraint, B will be added into the local pending list; if the number of the word alignments included in the local pending list is more than b, only the best b word alignments in the local pending list will be kept.

Wherein in the branch selecting unit 304 d, if the pending list is not empty, the process returns to unit 304 b, otherwise it enters the word alignment result output module 305.

The above embodiments of the present invention are provided in that order only for description and exposition purpose, not indicating relative importance of any specific embodiments.

Part of the steps in the embodiments of the present application can be implemented by software, which can be stored in the computer readable storage medium, such as CD or hard disk, etc.

The above embodiments are only some preferable example embodiments of the present invention, rather than indicating limits for this invention. Any modifications, substitution or duplication, and improvement, etc., under the spirit and principles of the present invention should be regarded as included within the protection scope of the present invention. 

What is claimed is:
 1. A method for performing bilingual word alignment, as characterized, comprising: preprocessing source and target texts of bilingual documents to be aligned; computing probability gains of adding a link between any pair of source and target words; setting the initial word alignment as an empty alignment; applying a greedy algorithm to iteratively search for word alignments that satisfy an inversion transduction grammar constraint; and outputting a best word alignment among the alignments that satisfy the inversion transduction grammar constraint as a final alignment result.
 2. The method of claim 1, wherein preprocessing the source and target text of the bilingual documents to be aligned comprises: splitting the source and target text into individual words; removing redundant blank characters; recording a number of words in the source text (denoted as I), and a number of words in the target text (denoted as J).
 3. The method of claim 1, wherein computing the probability gains of aligning any source and target words comprises: the probability gains, gain(i,j), are calculated as a ratio of probability p(f_(j), e_(i)) to a product of the probability p(f_(j), e_(i)) by a probability p(ε, e_(i)), wherein ${{gain}\mspace{11mu} \left( {i,j} \right)} = \frac{p\left( {f_{j},e_{i}} \right)}{{p\left( {f_{j},\varepsilon} \right)} \times {p\left( {\varepsilon,e_{i}} \right)}}$ and wherein e_(i) is an i^(th) source word, f_(j) is a j^(th) target word, p(f_(j), e_(i)) is a probability of e_(i) that is aligned to f_(j), p(ε, e_(i)) is a probability of e_(i) that is not aligned to any word, p(f_(j), ε) is a probability of f_(j) that is not aligned to any word.
 4. The method of claim 1, wherein applying the greedy algorithm to iteratively search for word alignments that satisfy the inversion transduction grammar constraint, comprises: initializing a pending list as a list that only contains an empty word alignment, wherein no word is aligned; generating a local pending list; resetting the pending list as the local pending list; determining whether to continue “generating a local pending list” for further iteration according to whether the pending list is empty or not: if the pending list is empty then stop the iteration, otherwise continue “generating a local pending list” for the next iteration.
 5. The method of claim 4, wherein generating a local pending list, comprises: setting the local pending list as an empty list; performing an expanding operation to each word alignment in the pending list, and adding to the local pending list new word alignments that meet certain conditions including: if a number of the word alignments in the local pending list is more than b, then b word alignments are kept in the local pending list according to a measured quality, and wherein b is a positive number.
 6. The method of claim 5, wherein performing an expanding operation to each word alignment in the pending list, comprises: wherein L is a set of links that have positive gains, for each word alignment A in the pending list and link l in L, if l is not in A, then add l to A to produce a new word alignment.
 7. The method of claim 5, wherein the certain conditions include: if the word alignment satisfies the inversion transduction grammar constraint, then the word alignment is added into the local pending list; otherwise the word alignment is not added to the local pending list.
 8. The method claim 1, wherein an evaluating method for a measured quality of a word alignment comprises, a product of (a) all probability values Π(j,i) p(f_(j), e_(i)) that are members of set a, and for any pair of f_(j) and e_(i) that have a link between them, and (b) all probability values Π(i) p(ε, e_(i)) that are not members of set a, and for any source word that is not aligned to any word, and (c) all probability values Π(j) p(f_(j), ε) that are not members of set a, for any target word that is not aligned to any word, and wherein p(f_(j), e_(i)) is a probability of e_(i) that is aligned to f_(j); p(ε, e_(i)) is a probability of e_(i) that is not aligned to any word; and p(f_(j), ε) is a probability of f_(j) that is not aligned to any word, and wherein $\prod\limits_{{({j,i})} \in a}{{p\left( {f_{j},e_{i}} \right)} \times {\prod\limits_{i \notin a}{{p\left( {\varepsilon,e_{i}} \right)} \times {\prod\limits_{j \notin a}{p\left( {f_{j}\varepsilon} \right)}}}}}$
 9. An apparatus for performing bilingual word alignment, comprising: a preprocessing module for preprocessing a source and a target text of bilingual documents to be aligned; a link gain computation module for computing a probability gain of adding a link between any source and target words, and selecting all links whose gains are positive; an initial word alignment generation module for generating an initial word alignment; a word alignment search module for using a greedy algorithm to iteratively search for word alignments that satisfy an inversion transduction grammar constraint; and a word alignment result output module for outputting a best word alignment measured by quality among alignments that satisfy the inversion transduction grammar constraint as a final alignment result.
 10. The apparatus of claim 9, wherein the preprocessing module splits the source and target text into individual words and removes redundant blank characters; and records a number of words in the source text (denoted as I), and a number of words in the target text (denoted as J).
 11. The apparatus of claim 9, wherein computing the gains of adding a link between any source and target word, comprises: the probability gains, gain(i,j), are calculated as a ratio of probability p(f_(j), e_(i)) to a product of a probability p(f_(j), e_(i)) by a probability p(ε, e_(i)), wherein ${{gain}\mspace{11mu} \left( {i,j} \right)} = \frac{p\left( {f_{j},e_{i}} \right)}{{p\left( {f_{j},\varepsilon} \right)} \times {p\left( {\varepsilon,e_{i}} \right)}}$ and wherein e_(i) is an i^(th) source word; f_(j) is a j^(th) target word; p(f_(j), e_(i)) is a probability of e_(i) that is aligned to f_(j); p(ε, e_(i)) is a probability of e_(i), that is not aligned to any word; p(f_(j), ε) is a probability of f_(j) that is not aligned to any word.
 12. The apparatus according to claim 9, wherein the word alignment searching module comprises: a pending list initializing unit, for initializing a pending list as a list only including empty word alignment; a local pending list generation unit, for expanding word alignments in the pending list, and generating a local pending list; a pending list resetting unit, for resetting the pending list into the local pending list; and a branch selecting unit, for determining whether to return to the local pending list generation unit, wherein if the pending list is not empty, control returns to the local pending list generation unit, and wherein control otherwise does not return to the local pending list generation unit.
 13. The apparatus of claim 12, wherein the local pending list generating unit is configured to perform the steps that include: performing an expanding operation to each word alignment in the pending list, and adding new word alignments that meet certain conditions into the local pending list, including: if a number of the new word alignments in the local pending list is more than b, then b word alignments are kept in the local pending list according to a measured quality.
 14. The apparatus of claim 13, wherein the performing an expanding operation, includes: wherein L is a set of links which have positive gains, for each word alignment A in the pending list and link l in L, if l is not in A, then add l to A to produce a new word alignment.
 15. The apparatus of claim 13, wherein the certain conditions include: if the word alignment satisfies the inversion transduction grammar constraint, then the word alignment is added to the local pending list.
 16. The apparatus according to claim 9, wherein an evaluating method for a measured quality of a word alignment comprises a product of all probability values Π(j,i) p(f_(j), e_(i)) that are members of set a, and for any pair of f_(j) and e_(i) that have a link between them, and all probability values Π(i) p(ε, e_(i)) that are not members of set a, and for any source word that is not aligned to any word, and all probability values Π(j) p(f_(j), ε) that are not members of set a, for any target word that is not aligned to any word, and wherein p(f_(j), e_(i)) is a probability of e_(i) that is aligned to f_(j); p(ε, e_(i)) is a probability of e_(i), that is not aligned to any word; and p(f_(j), ε) is a probability of f_(j) that is not aligned to any word, and wherein $\prod\limits_{{({j,i})} \in a}{{p\left( {f_{j},e_{i}} \right)} \times {\prod\limits_{i \notin a}{{p\left( {\varepsilon,e_{i}} \right)} \times {\prod\limits_{j \notin a}{p\left( {f_{j}\varepsilon} \right)}}}}}$ 